Skip to content

Conversation

@cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Nov 19, 2025

This PR enables testing cufile on a larger variety of filesystems where the root filesystem (/) might have a different mount point than the mount point of the checkout directory, by looking at all of /proc/mounts instead of only the first matching mount point.

@cpcloud cpcloud requested review from leofang and mdboom November 19, 2025 19:07
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 19, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud cpcloud changed the title fix incorrect cufile skipping logic test: fix overly restrictive cufile skipping logic Nov 19, 2025
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

@github-actions
Copy link

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

1 similar comment
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

Turns out these tests are never run on main because they are skipped. See https://github.com/NVIDIA/cuda-python/actions/runs/19510167362/job/55859660002 and search for cuda.bindings, you'll see this:

image

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

The logic I have here is I believe correct, but I'm not sure if these tests are just broken, or this is a latent bug that we've been living with for a while.

@rwgk
Copy link
Collaborator

rwgk commented Nov 19, 2025

Turns out these tests are never run on main because they are skipped.

Tangential:

Yeah, I ran into that twice with the ctk-next testing. SWQA was the first to run in environments where they are not skipped, but I didn't think of checking first if they actually ran in the cuda-python CI and started debugging painstakingly assuming they did.

It'd be awesome to have some sort of automatic flagging, after all jobs finish, to report which tests

  • are skipped everywhere
  • are always skipped on linux-64, linux-aarch64, win-64

@leofang
Copy link
Member

leofang commented Nov 19, 2025

  1. For cuFILE related changes, let us always rope in the GDS team for consultation (cc @sourabgupta3).
  2. Currently lots of things cannot be run in the CI ([FEA]: Improve CI coverage of cuFile #1143), not even in cuFILE's compat mode IIUC.

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

It's really an issue of incorrect skipping logic, not whether a thing runs in CI. This should have been running, because the runners have a compatible file system.

This wasn't running, but it was unintentionally not running.

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

Given all the failures happen on 13.0.2, and the 12.9.x tests all pass, that seems like a good place to start digging.

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

Ah, well those are not running the cuda.bindings tests, so that wouldn't explain the failure.

@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

1 similar comment
@cpcloud
Copy link
Contributor Author

cpcloud commented Nov 19, 2025

/ok to test

@leofang
Copy link
Member

leofang commented Nov 19, 2025

Let's give Sourab some time. I pinged him offline and he's looking into it.

@leofang
Copy link
Member

leofang commented Nov 19, 2025

Let's give Sourab some time. I pinged him offline and he's looking into it.

Ping'd you both in an internal thread.

@leofang leofang added bug Something isn't working triage Needs the team's attention test Improvements or additions to tests cuda.bindings Everything related to the cuda.bindings module labels Nov 20, 2025
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

What's blocking this PR? This PR is only about fixing the testing problem, and it shouldn't be blocked on any fixes/problems that were discovered here. This PR isn't changing any existing behavior, except that now the tests that were skipped when they should've been running (because the skipping criteria checking function was broken) are now xfailing.

We can merge this PR without waiting for cufile fixes, and we can continue the discussion on that elsewhere.

@cpcloud cpcloud force-pushed the fix-incorrect-cufile-skipping-logic branch from 15cbb8a to 40450d2 Compare December 2, 2025 16:59
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

/ok to test

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocking part is that the new xfail_handle_register decorator was not needed and it is puzzling. It is unclear to me what triggered this need. Can we please engage in the internal thread and understand it better before merging?

@cpcloud cpcloud force-pushed the fix-incorrect-cufile-skipping-logic branch from 40450d2 to 0cd211f Compare December 2, 2025 18:16
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

/ok to test

1 similar comment
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

/ok to test

@cpcloud cpcloud force-pushed the fix-incorrect-cufile-skipping-logic branch from d27a08f to 0cd211f Compare December 2, 2025 18:27
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

/ok to test

@cpcloud cpcloud force-pushed the fix-incorrect-cufile-skipping-logic branch from 0cd211f to 27b3825 Compare December 2, 2025 18:47
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

/ok to test

@cpcloud cpcloud force-pushed the fix-incorrect-cufile-skipping-logic branch from 27b3825 to 778239c Compare December 2, 2025 19:04
@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 2, 2025

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module test Improvements or additions to tests triage Needs the team's attention

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants